Task 1¶

f(x1, x2) = (x1 + x2)^2

Assume that x1, x2 ~ U[-1,1] and x1=x2 (full dependency)

Calculating PD¶

$g^{PD}_{x_1}(z) = E_{X_{-x_1}}[f(X_{x_1|=z})] = E_{X_{-x_1}}[z^2 + 2zX_2 + X_2^2] = z^2 + 2z \cdot 0 + \frac{1}{3} = z^2 + \frac{1}{3} $\ Since $E[U[-1, 1]] = 0$ and $E[U[-1, 1]^2] = \frac{1}{3}$

Calculating MP¶

Keeping in mind that because x1=x2, then sampling x2 given x1=z will give x2=z

$g^{MP}_{x_1}(z) = E_{X_{-x_1}|x_1=z}[f(X_{x_1|=z})] = E_{X_{-x_1}|x_1=z}[z^2 + 2zX_2 + X_2^2] = z^2 + 2z^2 + z^2 = 4z^2$

Calculating ALE¶

$g^{AL}_{x_1}(z) = \int^{z}_{-1}[E_{X_{-x_1}|x_1=v}\frac{\partial f(x)}{\partial x_1}]dv = \int^{z}_{-1}[E_{X_{-x_1}|x_1=v}[2x_1+2x_2]]dv = \int^{z}_{-1}4vdv = 2z^2 - 2$

Task 2¶

We will be investigating the same dataset as in HW1: spambase.csv from OpenML-100 databases. This database concerns emails, of which some were classified as spam emails (~39%), whereas the rest were work and personal emails. We will be training a Random Forest Classifier and an XGBoost.

Below are a 2 random entries of the dataframe and predictions of different models.

Observation True label RF prediction XGBoost prediction
Observation 4 1 0.9524 0.986
Observation 5 1 0.2884 0.635

As we can see, RF has predicted the true label for the first observation, but not for the second. XGBoost fared slightly better and predicted both observations correctly

Ceteris Paribus profiles¶

Below is a chart showing the CP profiles of the 2 observations, along 2 different variables:

chart1

Comment:

  1. The CP profiles are non-linear, non-monotonic - to be expected since our model is a tree based model.
  2. Most of the variability happens around 0, but that could be expected, as the values of word frequencies usually hover around 0.
  3. For the char_freq_24, for one observation the profile is constant for x>0.1 whereas for the second observation, it varies at x>0.1 (first drops slightly around x=0.5 and then raises slightly around x=1)

CP vs PDP¶

Below a chart with PDP of the same 2 variables, with individual CP profiles

chart2

Comment:

  1. We can see PDP is an average of the individual CP profiles
  2. In capital_run_length_average we can see that around half of the profiles (lines on bottom of chart) have a jump around x=150. However since PDP is an average, the jump is much less visible, since the rest of the cases are constant at this point.

PDP RF vs XGBoost¶

Below a chart showing PDP of our Random Forest vs XGBoost

chart3

Comment:

  1. In this case, our XGBoost PDPs are much more flat - this could be potentially due to the fact that we have trained a small XGB model, with depth=2 and n_estimators=50 (RF had depth=8 and n_estimators=200)
  2. The PDP varies more for our RF for the 2 variables showed - meaning an increase in them affects the RF prediction much more

Appendix¶

Data preparation¶

In [5]:
import numpy as np
import pandas as pd
import dalex as dx
import lime

spambase = pd.read_csv("spambase.csv")
In [6]:
df = spambase.drop(spambase.columns[0], axis=1) #Cleaning first column which is just index
In [7]:
df.describe()
Out[7]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_table word_freq_conference char_freq_%3B char_freq_%28 char_freq_%5B char_freq_%21 char_freq_%24 char_freq_%23 capital_run_length_average TARGET
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.104553 0.213015 0.280656 0.065425 0.312223 0.095901 0.114208 0.105295 0.090067 0.239413 ... 0.005444 0.031869 0.038575 0.139030 0.016976 0.269071 0.075811 0.044238 5.191515 0.394045
std 0.305358 1.290575 0.504143 1.395151 0.672513 0.273824 0.391441 0.401071 0.278616 0.644755 ... 0.076274 0.285735 0.243471 0.270355 0.109394 0.815672 0.245882 0.429342 31.729449 0.488698
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.065000 0.000000 0.000000 0.000000 0.000000 2.276000 0.000000
75% 0.000000 0.000000 0.420000 0.000000 0.380000 0.000000 0.000000 0.000000 0.000000 0.160000 ... 0.000000 0.000000 0.000000 0.188000 0.000000 0.315000 0.052000 0.000000 3.706000 1.000000
max 4.540000 14.280000 5.100000 42.810000 10.000000 5.880000 7.270000 11.110000 5.260000 18.180000 ... 2.170000 10.000000 4.385000 9.752000 4.081000 32.478000 6.003000 19.829000 1102.500000 1.000000

8 rows × 56 columns

In [8]:
X = df.loc[:, df.columns != 'TARGET']
In [9]:
y = df.loc[:, df.columns == 'TARGET']
In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)
In [11]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits = 5)

Random Forest¶

In [12]:
from sklearn.ensemble import RandomForestClassifier
In [13]:
RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X, y)
print("Train accuracy: ", accuracy_score(y, RF_final.predict(X)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_19912\3443453795.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X, y)
Train accuracy:  0.9565311888719844

Dalex CP¶

In [29]:
RFexplainer = dx.Explainer(RF_final, X, y)

cp = RFexplainer.predict_profile(new_observation=X.iloc[4:6])
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but RandomForestClassifier was fitted with feature names

Preparation of a new explainer is initiated

  -> data              : 4601 rows 55 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 4601 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x000001E17A372940> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0118, mean = 0.394, max = 0.992
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.902, mean = 7.09e-05, max = 0.947
  -> model_info        : package sklearn

A new explainer has been created!
Calculating ceteris paribus: 100%|██████████| 55/55 [00:01<00:00, 51.93it/s]
In [31]:
y.iloc[4:6]
Out[31]:
TARGET
4 1
5 1
In [32]:
observations = X.iloc[4:6]
print(RFexplainer.predict(observations))
[0.95239126 0.28840994]
In [30]:
cp.plot(variables=["char_freq_%24", "capital_run_length_average"])
In [23]:
pdp = RFexplainer.model_profile()
Calculating ceteris paribus: 100%|██████████| 55/55 [00:15<00:00,  3.52it/s]
In [33]:
pdp.plot(variables=["char_freq_%24", "capital_run_length_average"], geom="profiles", title="Partial Dependence Plot with individual profiles")

XGBOOST¶

In [36]:
import xgboost
In [37]:
model = xgboost.XGBClassifier(
    n_estimators=50,
    max_depth=2,
    use_label_encoder=False,
    eval_metric="logloss",
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X, y)
C:\Users\Antek\anaconda3\lib\site-packages\xgboost\sklearn.py:1395: UserWarning:

`use_label_encoder` is deprecated in 1.7.0.

Out[37]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=50, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=2,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=50, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)
In [38]:
def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]
In [39]:
XGexplainer = dx.Explainer(model, X, y, predict_function=pf_xgboost_classifier_categorical)
Preparation of a new explainer is initiated

  -> data              : 4601 rows 55 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 4601 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x000001E10484B5E0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 2.68e-05, mean = 0.394, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.997, mean = -2.73e-05, max = 0.993
  -> model_info        : package xgboost

A new explainer has been created!
In [41]:
pd.concat([RFexplainer.model_performance().result, XGexplainer.model_performance().result])
Out[41]:
recall precision f1 accuracy auc
RandomForestClassifier 0.911748 0.976373 0.942955 0.956531 0.988882
XGBClassifier 0.926089 0.955606 0.940616 0.953923 0.988560
In [42]:
pdp_XG = XGexplainer.model_profile()
Calculating ceteris paribus: 100%|██████████| 55/55 [00:01<00:00, 39.10it/s]
In [43]:
pdp_XG.plot(pdp, variables=["char_freq_%24", "capital_run_length_average"], title = "PDP comparison")
In [44]:
print(XGexplainer.predict(observations))
[0.98615324 0.6348719 ]